Add deterministic code and snippet memory identity by hunterbastian · Pull Request #181 · XortexAI/XMem

hunterbastian · 2026-05-15T01:35:39Z

Summary

Implements deterministic identity metadata for code annotations and personal snippets so XMem can avoid re-judging exact code/snippet memories with an LLM.

Changes:

add stable Pinecone metadata helpers for snippet identity, snippet search text, code annotation identity keys, and code annotation content hashes
route code and snippet memory through deterministic judge paths using metadata lookups
store snippet_hash, annotation_key, and annotation_hash in Pinecone metadata
keep snippet code exact in metadata while embedding only the searchable description/language/tags text
add regression coverage for repeated snippets across sessions and same-target code annotation updates

This addresses the edge case discussed in #141 where a user sends a snippet, then asks for the same snippet in another session. The normalized snippet_hash lets the judge no-op the duplicate without another model call.

Verification

python3 -m compileall src/schemas/code.py src/agents/judge.py src/pipelines/ingest.py src/pipelines/weaver.py tests/unit/test_schemas.py tests/test_deterministic_memory_layer.py
uv run --extra dev pytest tests/unit/test_schemas.py tests/test_deterministic_memory_layer.py -> 12 passed
uv run --extra dev pytest -> 44 passed
uv run ruff check --select F401 src/schemas/code.py src/agents/judge.py src/pipelines/ingest.py src/pipelines/weaver.py tests/unit/test_schemas.py tests/test_deterministic_memory_layer.py
git diff --check

/claim #141

gemini-code-assist

Code Review

This pull request implements deterministic judging for code annotations and snippets, centralizing parsing and metadata generation logic within src/schemas/code.py. The JudgeAgent is updated to handle these new domains by performing metadata lookups to decide between adding, updating, or skipping items. Feedback suggests optimizing the _deterministic_code and _deterministic_snippet methods by deduplicating incoming items and using asyncio.gather for parallel metadata lookups to improve performance and prevent redundant operations.

gemini-code-assist · 2026-05-15T01:38:02Z

+    async def _deterministic_code(
+        self, new_items: list, user_id: str,
+    ) -> JudgeResult:
+        operations: list[Operation] = []
+        for item in new_items:
+            content = str(item)
+            fields = code_annotation_fields_from_storage_content(content)
+            match = await self._lookup_metadata_match({
+                "user_id": user_id,
+                "domain": JudgeDomain.CODE.value,
+                "annotation_key": code_annotation_identity_key(fields),
+            })
+
+            if match is None:
+                operations.append(Operation(
+                    type=OperationType.ADD,
+                    content=content,
+                    reason="No code annotation with the same repo/target/type key.",
+                ))
+                continue
+
+            incoming_hash = code_annotation_content_hash(fields)
+            existing_hash = str((match.metadata or {}).get("annotation_hash", ""))
+            if incoming_hash == existing_hash:
+                operations.append(Operation(
+                    type=OperationType.NOOP,
+                    content=content,
+                    embedding_id=match.id,
+                    reason="Existing code annotation is unchanged.",
+                ))
+            else:
+                operations.append(Operation(
+                    type=OperationType.UPDATE,
+                    content=content,
+                    embedding_id=match.id,
+                    reason="Existing code annotation target has updated content.",
+                ))
+
+        return JudgeResult(operations=operations, confidence=1.0)


The _deterministic_code method processes items sequentially and lacks deduplication of the incoming new_items. If multiple identical annotations are extracted in a single turn, this will result in redundant operations and potential duplicate records in the vector store. It is recommended to deduplicate items by their identity key and use asyncio.gather to perform metadata lookups in parallel, maintaining consistency with the profile and temporal domains.

async def _deterministic_code( self, new_items: list, user_id: str, ) -> JudgeResult: # Deduplicate items by identity key to prevent redundant operations unique_items: dict[str, tuple[str, dict]] = {} for item in new_items: content = str(item) fields = code_annotation_fields_from_storage_content(content) key = code_annotation_identity_key(fields) unique_items[key] = (content, fields) async def _process_one(content: str, fields: dict) -> Operation: match = await self._lookup_metadata_match({ "user_id": user_id, "domain": JudgeDomain.CODE.value, "annotation_key": code_annotation_identity_key(fields), }) if match is None: return Operation( type=OperationType.ADD, content=content, reason="No code annotation with the same repo/target/type key.", ) incoming_hash = code_annotation_content_hash(fields) existing_hash = str((match.metadata or {}).get("annotation_hash", "")) if incoming_hash == existing_hash: return Operation( type=OperationType.NOOP, content=content, embedding_id=match.id, reason="Existing code annotation is unchanged.", ) else: return Operation( type=OperationType.UPDATE, content=content, embedding_id=match.id, reason="Existing code annotation target has updated content.", ) tasks = [_process_one(c, f) for c, f in unique_items.values()] operations = await asyncio.gather(*tasks) return JudgeResult(operations=list(operations), confidence=1.0)

gemini-code-assist · 2026-05-15T01:38:02Z

+    async def _deterministic_snippet(
+        self, new_items: list, user_id: str,
+    ) -> JudgeResult:
+        operations: list[Operation] = []
+        for item in new_items:
+            content = str(item)
+            fields = snippet_fields_from_storage_content(content)
+            match = await self._lookup_metadata_match({
+                "user_id": user_id,
+                "domain": JudgeDomain.SNIPPET.value,
+                "snippet_hash": snippet_identity_hash(fields),
+            })
+
+            if match is None:
+                operations.append(Operation(
+                    type=OperationType.ADD,
+                    content=content,
+                    reason="No snippet with the same normalized code/content identity.",
+                ))
+            else:
+                operations.append(Operation(
+                    type=OperationType.NOOP,
+                    content=content,
+                    embedding_id=match.id,
+                    reason="Same snippet was already stored for this user.",
+                ))
+
+        return JudgeResult(operations=operations, confidence=1.0)


Similar to _deterministic_code, the _deterministic_snippet method should deduplicate new_items by their identity hash and parallelize the metadata lookups using asyncio.gather to improve performance and prevent duplicate operations.

async def _deterministic_snippet( self, new_items: list, user_id: str, ) -> JudgeResult: # Deduplicate items by snippet hash to prevent redundant operations unique_items: dict[str, tuple[str, dict]] = {} for item in new_items: content = str(item) fields = snippet_fields_from_storage_content(content) h = snippet_identity_hash(fields) unique_items[h] = (content, fields) async def _process_one(content: str, fields: dict) -> Operation: match = await self._lookup_metadata_match({ "user_id": user_id, "domain": JudgeDomain.SNIPPET.value, "snippet_hash": snippet_identity_hash(fields), }) if match is None: return Operation( type=OperationType.ADD, content=content, reason="No snippet with the same normalized code/content identity.", ) else: return Operation( type=OperationType.NOOP, content=content, embedding_id=match.id, reason="Same snippet was already stored for this user.", ) tasks = [_process_one(c, f) for c, f in unique_items.values()] operations = await asyncio.gather(*tasks) return JudgeResult(operations=list(operations), confidence=1.0)

hunterbastian · 2026-05-15T01:51:44Z

Pushed a follow-up in 0345c53 addressing the Gemini review notes.\n\nWhat changed:\n- deduplicates incoming code annotations by deterministic annotation key before lookup;\n- deduplicates incoming snippets by normalized snippet hash before lookup;\n- runs the resulting metadata lookups concurrently with asyncio.gather;\n- added regression tests for duplicate code/snippet extractions in one deterministic judge batch.\n\nVerification rerun locally:\n- python3 -m compileall src/agents/judge.py tests/test_deterministic_memory_layer.py\n- git diff --check\n- uv run --extra dev pytest tests/test_deterministic_memory_layer.py tests/unit/test_schemas.py -> 14 passed\n- uv run --extra dev pytest -> 46 passed\n- uv run ruff check --select F401 src/agents/judge.py tests/test_deterministic_memory_layer.py

Ankit-Kotnala

@hunterbastian thanks for the update. The overall direction looks good, especially moving code/snippet judging to deterministic metadata lookups and addressing the dedupe/parallel lookup feedback.

I’d like to hold off on merging until two identity issues are fixed:

code_annotation_identity_key() should include both target_file and target_symbol when present. Right now it uses target_symbol over target_file, so the same symbol name in different files within a repo can collide.
snippet_identity_hash() should preserve code identity more strictly. Since stable_hash() lowercases and collapses whitespace, two different code snippets can be treated as the same snippet. For code snippets, we should hash the normalized code text without lowercasing/collapsing internal whitespace.

Please add regression tests for both cases and rerun the full CI once the workflow is approved.

hunterbastian · 2026-05-16T18:14:10Z

Pushed follow-up in f51930f for the requested identity fixes.\n\nWhat changed:\n- code_annotation_identity_key() now includes both target_file and target_symbol, so same-symbol annotations in different files no longer collide.\n- snippet_identity_hash() now hashes normalized code text without lowercasing or collapsing internal whitespace when a code snippet is present.\n- Added regression coverage for both collision cases and updated the deterministic weaver expectation to the new annotation key shape.\n\nVerification rerun locally:\n- .venv/bin/python -m pytest tests/unit/test_schemas.py -q -> 6 passed\n- .venv/bin/python -m pytest tests/test_deterministic_memory_layer.py -q -> 10 passed\n- .venv/bin/python -m pytest -q -> 48 passed\n- git diff --check -> passed\n\nThe PR Labeler check is also green on the new head.

ishaanxgupta · 2026-05-19T05:02:00Z

@hunterbastian Please have a discussion on the issue #141 so that we can discuss the approach first and then you can implement in the PR.

greptile-apps · 2026-05-23T09:23:41Z

Greptile Summary

This PR introduces deterministic identity metadata for code annotations and personal snippets, letting XMem skip LLM judgment calls when re-encountering the same code or snippet across sessions. Parsing, hashing, and Pinecone metadata construction are centralized in src/schemas/code.py, the judge routes CODE and SNIPPET domains through new deterministic paths, and the weaver uses the shared helpers instead of inline dicts.

snippet_identity_hash uses the normalized code text (case-preserved, trailing-whitespace stripped) as the primary identity signal, falling back to a case-insensitive description+language hash when no code is present — correctly handling cross-session duplicates like the \ -vs-\\\ encoding difference between storage and ingestion.
code_annotation_content_hash identifies annotation changes by hashing identity key + severity + content; currently uses the case-insensitive stable_hash, which would suppress an UPDATE when only the annotation body casing changes.
ingest.py creates ephemeral JudgeAgent instances scoped to the right vector store for each domain, and correctly binds weaver.snippet_vector_store before both the judge and the weaver execute.

Confidence Score: 3/5

The new deterministic paths work correctly for the happy path, but a case-only change to a code annotation's body text would be silently swallowed as a no-op rather than written as an update.

The code_annotation_content_hash function hashes the annotation content through stable_hash, which collapses casing. Any annotation whose body changes purely in letter-case (e.g. correcting identifier casing in free-text, changing 'NULL' to 'null') will produce the identical hash before and after the change. The deterministic judge will classify the incoming item as NOOP and the stored annotation will remain stale. All other logic — the snippet hash path, the ingest wiring, the weaver metadata helpers, and the test coverage — looks correct.

src/schemas/code.py — specifically code_annotation_content_hash and its use of stable_hash for the content field.

Important Files Changed

Filename	Overview
src/schemas/code.py	Adds deterministic identity helpers (parse, hash, metadata builders) for code annotations and snippets; `code_annotation_content_hash` uses the case-insensitive `stable_hash`, which can silently suppress updates when only annotation body casing changes.
src/agents/judge.py	Routes CODE and SNIPPET domains through new deterministic paths; `_lookup_metadata_match` correctly uses `asyncio.to_thread` matching the existing profile pattern, but has no fallback when `search_by_metadata` is unavailable.
src/pipelines/ingest.py	Replaces LLM judge calls with ephemeral deterministic JudgeAgent instances for code and snippet domains; correctly binds snippet_vector_store before both the judge and the weaver now.
src/pipelines/weaver.py	Replaces inline metadata dicts and local parser functions with schema helpers; embedding text for snippets is now richer (description + language + tags) instead of bare description.
tests/test_deterministic_memory_layer.py	Adds four integration tests covering cross-session snippet dedup, batch dedup, metadata persistence, and code annotation update detection; FakeVectorStore correctly models metadata equality search.
tests/unit/test_schemas.py	Adds unit tests for all new schema helpers; verifies hash stability, identity key format, and metadata field values.

Sequence Diagram

sequenceDiagram
    participant Ingest as IngestPipeline
    participant Judge as JudgeAgent (ephemeral)
    participant Schema as schemas/code.py
    participant Store as VectorStore (Pinecone)
    participant Weaver as Weaver

    Ingest->>Judge: "arun_deterministic({domain: CODE/SNIPPET, new_items, user_id})"
    Judge->>Schema: code_annotation_fields_from_storage_content(content)
    Schema-->>Judge: fields dict
    Judge->>Schema: code_annotation_identity_key(fields) / snippet_identity_hash(fields)
    Schema-->>Judge: identity key / hash
    Judge->>Store: "search_by_metadata({user_id, domain, annotation_key/snippet_hash})"
    Store-->>Judge: SearchResult or None
    alt No match found
        Judge-->>Ingest: Operation(ADD)
    else Match found, hash unchanged
        Judge-->>Ingest: Operation(NOOP)
    else Match found, hash changed (code only)
        Judge-->>Ingest: Operation(UPDATE)
    end
    Ingest->>Weaver: execute(judge_result, domain, user_id)
    Weaver->>Schema: code_annotation_pinecone_metadata / snippet_pinecone_metadata
    Schema-->>Weaver: metadata dict (with annotation_key/annotation_hash or snippet_hash)
    Weaver->>Store: add / update with enriched metadata

_{Reviews (1): Last reviewed commit: "Fix deterministic code identity collisio..." | Re-trigger Greptile}

greptile-apps · 2026-05-23T09:23:45Z

+def code_annotation_content_hash(fields: dict[str, Any]) -> str:
+    return stable_hash(
+        code_annotation_identity_key(fields),
+        fields.get("severity"),
+        fields.get("content"),
+    )


code_annotation_content_hash calls stable_hash, which runs every part through normalize_lookup_text (lowercase + collapse-whitespace). A case-only change to an annotation body — e.g. correcting Auth.login → auth.login in the free-text content, or capitalizing a variable name — produces the same hash as the original, so the deterministic judge returns NOOP and silently discards the update. Code identifiers in annotation text can be case-significant.

Suggested change

def code_annotation_content_hash(fields: dict[str, Any]) -> str:

return stable_hash(

code_annotation_identity_key(fields),

fields.get("severity"),

fields.get("content"),

)

def code_annotation_content_hash(fields: dict[str, Any]) -> str:

return strict_hash(

code_annotation_identity_key(fields),

fields.get("severity"),

fields.get("content"),

)

greptile-apps · 2026-05-23T09:23:46Z

+    async def _lookup_metadata_match(
+        self, filters: Dict[str, Any],
+    ) -> Optional[SearchResult]:
+        if not self.vector_store:
+            return None
+        search_fn = getattr(self.vector_store, "search_by_metadata", None)
+        if search_fn is None:
+            return None
+        results = await asyncio.to_thread(search_fn, filters=filters, top_k=1)
+        return _first_match(results or [])


No fallback when search_by_metadata is absent

_lookup_metadata_match silently returns None when the injected vector store lacks a search_by_metadata method. Every incoming CODE or SNIPPET item then resolves to OperationType.ADD, so the same snippet or code annotation will be re-inserted on every session rather than deduped. The existing _fetch_similar_profile_metadata falls back to a semantic search in this case; applying the same fallback (or at least a warning) here would keep the two deterministic paths consistent and avoid silent duplicate growth.

Add deterministic code and snippet memory identity

c79a067

hunterbastian requested review from ishaanxgupta and ved015 as code owners May 15, 2026 01:35

github-actions Bot added tests pipelines agents labels May 15, 2026

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

Deduplicate deterministic code judge inputs

0345c53

ishaanxgupta requested a review from Ankit-Kotnala May 16, 2026 13:34

Ankit-Kotnala requested changes May 16, 2026

View reviewed changes

Fix deterministic code identity collisions

f51930f

hunterbastian mentioned this pull request May 19, 2026

Redesign, Audit and Revamp Code & Snippet Agents + Judge Logic #141

Open

greptile-apps Bot reviewed May 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add deterministic code and snippet memory identity#181

Add deterministic code and snippet memory identity#181
hunterbastian wants to merge 3 commits into
XortexAI:mainfrom
hunterbastian:codex-code-snippet-schema

hunterbastian commented May 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

hunterbastian commented May 15, 2026

Uh oh!

Ankit-Kotnala left a comment •

edited

Loading

Uh oh!

hunterbastian commented May 16, 2026

Uh oh!

ishaanxgupta commented May 19, 2026

Uh oh!

greptile-apps Bot commented May 23, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 23, 2026

Uh oh!

greptile-apps Bot May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hunterbastian commented May 15, 2026

Summary

Verification

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

hunterbastian commented May 15, 2026

Uh oh!

Ankit-Kotnala left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hunterbastian commented May 16, 2026

Uh oh!

ishaanxgupta commented May 19, 2026

Uh oh!

greptile-apps Bot commented May 23, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ankit-Kotnala left a comment •

edited

Loading